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' Abstract In contrast to many other scientific disciplines, computer science con- 

5^ , siders conference publications. Conferences have the advantage of providing fast 

publication of papers and of bringing researchers together to present and discuss 
the paper with peers. Previous work on knowledge mapping focused on the map 
of all sciences or a particular domain based on ISI published JCR (Journal Ci- 
tation Report). Although this data covers most of important journals, it lacks 
computer science conference and workshop proceedings. That results in an im- 
precise and incomplete analysis of the computer science knowledge. This paper 
l_J , presents an analysis on the computer science knowledge network constructed from 

all types of publications, aiming at providing a complete view of computer sci- 
ence research. Based on the combination of two important digital libraries (DBLP 
^ ■ and CiteSeerX), we study the knowledge network created at journal /conference 

level using citation linkage, to identify the development of sub-disciplines. We 
investigate the collaborative and citation behavior of journals/conferences by an- 
alyzing the properties of their co-authorship and citation subgraphs. The paper 
, draws several important conclusions. First, conferences constitute social structures 

' that shape the computer science knowledge. Second, computer science is becoming 

, more interdisciplinary. Third, experts are the key success factor for sustainability 

of journals/conferences. 
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1 Introduction 



Recent studies on knowledge mapping in scientometrics are concerned with build- 
ing, visualizing and qualitatively analyzing the knowledge networks of sciences 
[5llll l26l[33] . Similar to the geographical map, the knowledge network of sciences, 
or the map of sciences is used to provide us an insight into the structure of sci- 
ence. It can be used to visually identify major areas of science, their similarity 
and interconnectedness. Methods developed in bibliometrics and scientometrics 
such as citation analysis, content analysis and recently proposed method based on 
clickstream data [2] are commonly used in this domain. 

Computer science is a fast-changing research field. Unlike other disciplines 
where academic standard of publishing is to publish in journals, in computer sci- 
ence we consider conference publication. Previous work on knowledge mapping 
typically focused on single disciplines [44. ,10..30. .4j or on the whole science [5l l21l 
[2] based on the analysis of massive citation data such as Journal Citation Report 
(JCR), Science Citation Index (SCI), Science Citation Index Expanded (SCIE) 
and Social Science Citation Index (SSCI), published by Thompson Scientific (TS, 
formally ISI). Those datasets cover most of important journals of science, but 
they do not contain computer science conference and workshop proceedings. That 
makes any attempt to map computer science knowledge either imprecise or limited 
to small fields. 

With the recent availability of large-scale citation index from digital libraries 
in computer science such as ACM Portafl IEEE Xplorfl DBLllfl and CiteSeerX0, 
it is possible to study the relationship between publication venues and provide a 
more precise and complete view of today's computer science research landscape 
at both local and global scale. In this paper (some of results are published in 
an earlier conference paper [39 ), we are concerned with studying the structure 
of knowledge network and the publication culture in computer science. Using the 
combination of two large important digital libraries in computer science, DBLP 
and CiteSeerX, we build a so-called knowledge map of the computer science and 
provide a comprehensive visualization which allows us to explore its macro struc- 
ture and its development over time. To get an insight into the collaborative and 
citation behavior in computer science, we investigate the graphical features of the 
citation and collaboration subgraphs of journals/conferences. One of our main 
findings is that conferences constitute social structures that shape the computer 
science knowledge. By analyzing the combined knowledge network of journal and 
conference publications, we are able to identify clusters (or sub-disciplines) and 
trace their development, which is not possible by the analysis of journals only. 
We also find that computer science publications are very heterogeneous and the 
field is becoming more interdisciplinary as each sub-disciplines tends to connect to 
many other sub-disciplines. Finally, there is a connection between the local struc- 
ture of the citation and collaboration subgraphs of journals/conferences and their 
impact. On the one hand, high impact journals/conferences successfully build the 
core topic and attract the contributions from research community. On the other 
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hand, experts are the key success factor for maintaining and cultivating the com- 
munity of journals/conferences (hereafter called venues). 

The paper is organized as follows. In Section 2, we briefly survey the related 
work. In Section 3, we discuss about the role of conferences in computer science. In 
Section 4, we describe the data set used in our study. In Section 5, the creation of 
networks used in our study is presented. In Section 6, we discus about the network 
visualization. In Section 7 we discuss about the development of sub-disciplines in 
computer science. In Section 8, we present the venues ranking using SNA metrics. 
In section 9, we present our analysis on the properties of venue's subgraph and 
their relation to the impact of venues. The paper finishes with some conclusions 
and our directions for future research. 



2 Related Work 

Social network analysis and visual analytics have been applied to represent the 
knowledge [47_ , to detect the communities and hierarchical structures in dynamic 
networks jl4ll29j . In scientometrics, the knowledge maps have been generated from 
citation data to visualize the relationship between scholarly publications or disci- 
plines. Early work on mapping journals focused on single disciplines. Morris [33] 
explored the interdisciplinary nature of medical informatics and its internal struc- 
ture using inter-citation and co-citation analysis. Combination of the SCI and SSCI 
data was used in this study. McCain [30] performed the co-citation analysis for 
journals in neural network research. Cluster analysis, principal component analysis 
and multidimensional scaling (MDS) maps were used to identify the main research 
areas. Regarding to computer science. Ding [10] studied the relationship between 
journals in information retrieval area using the same techniques. Based on the 
ScieSearch database, Tsay [H] mapped semiconductor literature using co-citation 
analysis. The datasets used in these studies were rather small, ranging from tens 
to several hundred journals. In more recent work, Boyack ^ mapped the structure 
and evolution of chemistry research over a 30-year time frame. Based on a general 
map generated from the combined SCIE and SSCI from 2002, he assigned journals 
to clusters using inter-citation counts. Journals were assigned to the chemistry do- 
mains using JCR categories. Then, the maps of chemistry at different time periods 
and at domain level were generated. Maps show many changes that have taken 
place over the 30 years development of chemistry research. 

Recently, several maps based on large-scale digital libraries have been pub- 
lished. ISI has published journal citation reports for many years. This dataset 
allows for generating the map of all of sciences. Leydesdorff has used the 2001 
JCR dataset to map 5,748 journals from the SCI [26] and 1,682 journals from the 
SSCI [23 in two separate studies. In those studies, Leydesdorff used Pearson corre- 
lation on citation counts as the edge weight and progressive lowering threshold to 
find the clusters. These clusters can be considered as disciplines or sub-disciplines. 
Moya-Anegon et al. [34j created category maps using documents with a Spanish 
address and ISI categories. The high level map shows the relative positions, sizes 
and relationships between 25 broad categories of science in Spain. Boyack [51 com- 
bined SCIE and SSCI from 2000 and generated maps of 7,121 journals. The main 
objective of this study was to evaluate the accuracy of maps using eight different 
inter-citation and co-citation similarity measures. 



There are several studies which apphed SNA measures to derive useful infor- 
mation from knowledge maps. Leydesdorff [28] used the combination of SCIE and 
SSCI, and generated centrality measures (betweenness, closeness and degree cen- 
trality). These measures were analyzed in both global (the entire digital library) 
and local (small set of journals where citing is above a certain threshold) environ- 
ments. Bollen et al. [2] generated the maps of science based on clickstream data 
logged by six web portals (Thomson Scientific, Elsevier, JSTOR, Ingenta, Uni- 
versity of Texas and California State University). They validated the structure of 
the maps by two SNA measures: betweenness centrality [46 and PageRank [6]. In 
another study, Bollen [3 performed a principal component analysis on 39 scientific 
impact factors, including four SNA factors (degree centrality, closeness centrality, 
betweenness centrality and PageRank). 

Regarding to the research on the performance of individuals and their local 
social network structures, Shi et al. [JT] studied the citation projection graphs of 
publications in different disciplines, including natural science, social science and 
computer science, to understand their citation behaviors. Using several social net- 
work analysis measures, they identified the idiosyncratic citers, within-community 
citers and brokerage citers. They found that there are significant differences in 
how high, low and medium impact papers position their citation. There are also 
other studies on the optimal network structure for the individuals' performance 
[23) . the benefits of the communities in fostering trust, facilitating the enforcement 
of social norm and common culture 9 , and the benefits of structural holes and 
weak ties in accessing new information and ideas [15] . 

3 The Role of Conferences in Computer Science Research 

Computer science history can be traced back from 1936, with the invention of Tur- 
ing machine. Till early 1970s, the main publication outlet is journals. The Jour- 
nal of Symbolic Logic (born in 1936), IEEE Transactions on Information Theory 
(1953), Journal of the ACM (1954), Information and Computation (1957) and 
Communications of the ACM (CACM) (1959) are probably the oldest journals in 
computer science. In late 1960s and early 1970s, some conferences emerged. IFIP 
Congress (1962), SYMSAC(Symposium on Symbolic and Algebraic Computation) 
(1966), the ACM Symposium on Operating Systems Principles (SOSP) (1966), 
Symposium on Operating Systems Principles (SOSP) (1967), International Joint 
Conference on Artificial Intelligence (IJCAI) (1969), Architecture of Computing 
Systems (ARCS) (1970), International Colloquium on Automata, Languages and 
Programming (ICALP) (1972), Symposium on Principles of Programming Lan- 
guages (POPL) (1973) are some examples of the earliest conference series. Since 
early 1980s, conference has had a dominant present in computer science. Accord- 
ing to DBLP digital library, as of 2010 there are 2716 conference series and 774 
journals. 

In 2009 and 2010, a dozen of articles, letters and blog entries discussed about 
the role of conferences, the quality and impact of conference publications [4511321 
111] . In [32], Menczer supports the abolition of conference proceedings altogether 
and submissions should instead go to journals, which would receive more and more 
better ones. The impact and quality of conference publications are also questioned, 
mainly dues to the review process. Every conference has a desire to be "competitive 



and reducing the acceptance rate is an easy way. The great papers always are 
accepted and the worst papers mostly get rejected, but the problem here is for 
the vast majority of papers landing in the middle. That leads to an emphasis 
on safe papers (incremental and technical) versus those that explore new models 
and research directions outside the established core areas of the conferences 
Nevertheless, recent study by J. Chen and J. Kostan [7] shows that within ACM, 
papers in highly selective conferences are cited at a rate comparable to or greater 
than ACM transactions and journals. Freyne et al. demonstrates that papers 
in leading conferences match the impact of papers in mid-ranking journals and 
surpass the impact of papers in journals in the bottom half of the Thompson 
Reuters rankings. 

Why conference becomes an important outlet in computer science? In [11], L. 
Fortnow gave a short history of computer science conferences and the reasons for 
that computer science holds conferences. The fundamental reason is that the quick 
development of the field requires a rapid review and distribution of results. A com- 
plete journal publishing decision takes at least one year, comparing to 6 months 
for publishing in a conference. That delay is unacceptable for such a fast-changing 
field. Secondly, conferences bring the community together to disseminate new re- 
search and results, to network and discuss about the issues. That rarely happens 
in journals, where the only possible communication is between reviewers, edito- 
rial board and authors in review process. Lastly, with the tremendous continual 
growth in computer science, there are too many papers to publish and archival 
journals alone can not handle. 

Our work is based on the above intuitions. We show that analysis on journal 
only can not fully capture the characteristics and development of computer sci- 
ence research since focusing exclusively on journal papers misses many significant 
papers published by conferences. We further show that conferences facilitate the 
communication and build a community between participants. 

4 Data Collection 

The dataset used in our study is the combination of DBLP and CiteSeerX dig- 
ital libraries. We choose them because they cover most of sub-disciplines, while 
IEEE Xplore and ACM Portal cover only IEEE and ACM journals and conference 
proceedings. We retrieve the publication list of journals/conferences from DBLP. 
Unfortunately, DBLP does not record citations. Therefore, we use CiteSeerX to 
fill the citation list of publications in DBLP. 

DBLP data was downloaded in July, 2009 which consists of 788,259 author's 
names, 1,226,412 publications and 3,490 venues. At the same time, we obtained 
CiteSeerX data by first download the OAI (Open Archives Initiative) dataset using 
the OAIHavester API. Since the OAI dataset contains only references between 
publications which are stored in CiteSeerX (with PDF documents) , we continued 
to crawl XML documents from CiteSeerX site to obtain full citation list for each 
publication. Overall, we had complete CiteSeerX data with 7,385,652 publications 
(including publications in reference lists), 22,735,140 references and over 4 million 
author names. 

Naming is a problem that many digital libraries are faced because one author 
may have several names (synonyms) or there are several authors with the same 
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Fig. 1: Citation distribution 



name (homonyms). For example, in DBLP we can find seven authors with the 
name Chen Li. Consequently, several techniques have been developed for naming 
problem in digital libraries |f 6ll38l[2ill43] . In our analysis, we realize on the ap- 
proaches that are implemented in CiteSeerX [1] and DBLP [25] and consider that 
the authors in these databases are identical. 

We combined DBLP and CiteSeerX using a simple technique called canopy 
clustering [31 . The basic idea of is to use a cheap comparison metric grouping 
records into overlapping clusters called canopies. After that, records in the same 
cluster are compared using more expensive (and more accurate) similarity mea- 
sures. We employed this idea to solve our problem. Firstly, publications in DBLP 
and CiteSeerX are clustered using the last name of authors. It can be argued as 
to whether the last name of authors give us the correct clusters, since one name 
can be expressed differently (e.g. Michael Ley vs. Ley Michael). However, in most 
cases author names of the same papers are presented in the same way in both 
digital libraries. In the second step, we used two similarity metrics to compare pa- 
per titles in each cluster: one less expensive Jaccard similarity to filter out papers 
which are clearly un-matched, another more expensive Smith- Waterman distance 
to correctly identify pair of matched papers. The process was implemented in Java 
using the SecondStrin^ library and an Oracle database. 

Overall, the matching algorithm gave us 864,097 pairs of matched publications, 
meaning about 70% publications in DBLP were matched to publications in Cite- 
SeerX. On average, each venue cites others 2306 times and is cited 2037 times. 
The distribution of the citations over years is given in Fig. [1] where the number 
of citations in 2009 and 2010 are low, simply because new publications are not 
crawled by CiteSeerX yet. It is not known whether this result reflects the real 
coverage of DBLP and CiteSeerX. However, in our experience lots of publications 
in CiteSeerX are not indexed in DBLP. The reason is that DBLP does not index 
some publication types such as pre-prints, in-prints, technical reports and letters. 
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and it covers a limited number of PhD theses, master theses and books. That does 
not affect our analysis since we focuss on journal and conference publications. 
On the other hand, not all publications in DBLP are indexed by CiteSeerX. If a 
publication is not online and public, it will not be crawled by CiteSeerX. 



5 Networks Creation 

We created two networks using the dataset described above: one knowledge network 
K based on relatedness of venues and one citation network F based on citation 
counts. We processed as following: 

Bibliography coupling counts were calculated at the publication level on the 
whole digital libraries. These counts were aggregated at the venue level (3,490 
venues), giving us the bibliography coupling counts between pairs of venues. Of 
3,490 venues, 303 venues which have no citations were excluded. The result is a 
symmetric bibliography coupling frequency matrix V with venues as columns and 
rows. Based on this matrix, we created the knowledge network K by normalizing 
bibliography coupling counts using cosine similarity as suggested in [21] , in which 
the full version of cosine index was used. Concretely, cosine similarity between pair 
of venues is computed as: 
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where dj is the cosine similarity between venue Vi and V, , Bi is the vector 
representation of the list of citations from venue Vi to all publications, n is number 
of publications in the database, and Bi^k is the number of times venue Vi cites 
publication k. The resulting network consists of 1,930,471 un-directed weighted 
edges. 120 venues whose cosine similarity to others equal to zero were not included 
in the network. 

The citation network F is formulated by counting the inter-citation between 
venues. Nodes are venues and there is an edge from venue Vi to venue Vj if Vi cited 
Vj, weighted by number of that citations. The network contains 351,756 directed 
edges, resulting in a network density of 3.5%. 

To prevent noise in the visualization and analysis, we consider the most rele- 
vant connections between venues. For the knowledge network K , we eliminated all 
connections which have cosine similarity smaller than 0.1, obtaining the reduced 
network K' whose connection cosine similarity is in the range [0.1, 1.0]. Although 
this threshold is arbitrary, the network K' retains 1,739 nodes and 9,637 connec- 
tions, corresponding to 57% of the nodes and 0.5% of the edges of the original 
network. For the citation network F, the same procedure was performed in which 
we only keep the connections whose citation counts were greater than 50. The 
remaining network F' contains 1,060 venues and 9,964 connections, corresponding 
to 33% of the nodes and 2.8% of the edges of the original network. A summary of 
networks properties is given in Table [1] 



Table 1: Networks Summary 



Property 



F 



F' 



K 



K' 



Nodes 
Edges 



3,187 
351,756 



1,060 
9,964 
6 



3,067 
1,930,471 



1,739 
9,637 
71 



Components 
Density 

Clustering coef. 



1 

3.5% 
0.569 



0.89% 
0.764 



1 

20% 
0.786 



0.3% 
0.629 



The reason for creating two networks is as follows. Because of the diversity of 
publication types and interdisciplinary nature of computer science, publications 
often refer to the publications (e.g. preprints, letters) which may not be published 
by any journals, conferences or workshops. The references also point to the pub- 
lications in other disciplines. For example, lots of papers on SNA cite the work 
done by Newman and Barabasi which are published in science journals (Phy. Rev. 
Letters or Nature) . That should be considered when calculating the similarity be- 
tween venues. Therefore, we computed the cosine similarity on the complete list 
of references at the paper level, then aggregated at the venue level to create the 
knowledge network. However, to study the information diffusion and the impact of 
venues in the domain, we need only the citation counts between themselves. The 
citation network was created based on the inter-citation counts between venues, 
accordingly. 

6 Knowledge Network Visualization 

We visualized the knowledge network K' using smart organic layout implemented 
m the yFilefO library, based on the force-directed paradigm |,13^. The visualization 
is given in Fig. [2l where venues are represented as circles with diameter denot- 
ing the number of publications and the thickness of connections denotes the cosine 
similarity. Nodes are colored according to their assignment to domain categories in 
Microsoft Academic Searcl|3|(Libra) . White color nodes are un-categorized venues. 
Libra assigns 2637 venues to 23 domains, so 430 venues in our database remain 
un-categorized. We also accounted that some venues are assigned to multiple do- 
mains. For those, we randomly chose one of the assigned domains. Fig. |3] gives 
the visualization of the knowledge network using journals only, which allow us to 
compare the visual structures of the two networks. 

Any interpretation of the visual structure of the knowledge network in Fig. [2] 
has to take into account the following considerations. Firstly, different iterations of 
force-directed algorithm can converge on different visualizations of the knowledge 
network. Fig. [2] is not the only or best possible visualization. It is selected because 
it represents a clear visualization of connections between venues in the knowledge 
network and its main structural features were stable across many iterations of the 
visualization algorithm. Secondly, the force-directed algorithm groups together 
venues that are strongly connected in the knowledge network. The appearance of 
clusters is thus depends on the weight of the connections in the knowledge network 
and is not the artifact of the visualization. Finally, the exact geometric coordinates 
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Fig. 2: The combined knowledge network (giant component) 



of journals/conferences and clusters vary depending on the visualization algorithm 
and are thus considered artifacts of the visualization. 

Fig.[2]shows us a clear cluster structure in which venues in the same domain are 
placed in clusters. In contrast, the network of journals only (Fig. |3]) is little "un- 
ordered" and one can not identify sub-disciplines from this network. In Fig. [2] large 
and coherent clusters are algorithms and theory, artificial intelligence, software 
engineering, security and privacy, distributed and parallel computing, networks 
and communications, computer graphics, computer vision, databases, data mining 
and machine learning. They cover most of the core topics of computer science. 
Some domains do not have their own clusters. Venues in those domains are placed 
in the same clusters with venues from closely related domains. For example, data 
mining and machine learning are combined in one cluster; information retrieval 
sticks to databases; natural language and speech processing is a sub-group of the 
artificial intelligence cluster etc. That result reflects the hierarchical structure of 
domain classification. 




Fig. 3: The knowledge network using journals only 



Connections between venues in the network cross multiple domains. Domi- 
nating in the middle of the network are venues in algorithms and theory. This 
domain are connected to many other domains in the border of the wheel. The sec- 
ond dominator at the center is databases. In clockwise order, starting at 12AM, 
databases is tightly connected to information retrieval, data mining and machine 
learning (1PM), artificial intelligence (2PM), as well as software engineering (the 
green color, at 3PM to 4 PM). Computer graphics connects to computer vision, 
multimedia and human-computer interaction studies. We can also easily identify 
the cluster of bioinformatics which has connections to artificial intelligence, data 
mining and machine learning. At the bottom of the wheel, there is a mixed clus- 
ter of venues from hardware and architecture, real-time and embedded systems, 
security and privacy. This cluster connects strongly to software engineering and 
distributed computing. 

Although the visualization of the knowledge network at venues level shows us 
a clear cluster structure, it would be more pleasant to see the visualization at the 



cluster level. During the network reduction process, lots of venues were excluded. 
To make the visualization at cluster level more precise, we process as follows: 

— The knowledge network K' is clustered using a density-based clustering algo- 
rithm proposed by Newman and Clauset [35l[8] . The basic idea of the algorithm 
is to find a division of the network into clusters within which the network con- 
nections are dense, but between which they are sparser. To measure the quality 
of a division, the modularity Q [361137] is used. In our case, the algorithm gives 
us 92 clusters with the modularity Q = 0.771. 

— Using the bibliography coupling frequency matrix V where columns and rows 
are venues, the counts were aggregated to cluster level for the venues which were 
assigned to clusters, thus give us the bibliography coupling counts between un- 
clustered venues and clusters. That results in a bibliography coupling frequency 
matrix V' with venues and clusters in both columns and rows. We calculate 
the cosine index between 1328 un-clustered venues and 92 clusters, and assign 
un-clustered venues to clusters with which they have highest cosine values. 

— After that, cosine index is re-computed for pairs of clusters in the same way 
as we did for venues. 

Fig. H] the visualization at cluster level where clusters are squares with the size 
denoting the number of venues and the weight of the connection between clusters 
is the cosine similarity. Clusters are colored using the same color scheme as in Fig. 
[2] The colors show the fraction of domain venues in clusters. To prevent clutter, 
for each cluster we retain the 2 strongest outbound relationships. The network is 
manually labeled based on the assignment of clusters to particular domains. 

The network in Fig.|3]can be interpreted as follows. In general, the appearance 
of the network is similar to the network in Fig. [2] Most of domains are assigned to 
more than one clusters in which they dominate or share the "power" with other 
related fields. The exceptions are graphics and bioinformatics which are uniquely 
assigned to one cluster. Large clusters are composed of several closely related 
domains (except for the large clusters of algorithms and theory, and software engi- 
neering, where the venues of these fields dominate the clusters). For example, one 
cluster in the upper half of the diagram contains machine learning, AI, databases, 
data mining, information retrieval and the world wide web. These fields seem to 
be very exciting research areas with one large cluster and many small ones closely 
connected to each other. AI is the most interdisciplinary area. Venues in this field 
are distributed in multiple clusters which have many connections to other areas 
such as databases, data mining, information retrieval, machine learning, WWW, 
software engineering, algorithms and theory, bioinformatics and HCI. Computer 
vision, multimedia and graphics are quite marginal topics which have relationships 
only to machine learning. 

7 The Evolution of the Knowledge Network 

The visualization given in Fig. [2] is useful for observing the recent organization 
of the computer science knowledge. Now the interesting question is that how the 
computer science knowledge comes to this stage. In particular, we would like to 
see the development of the research areas - how the new fields emerge and develop 
over time, how venues come together to form clusters (sub-disciplines) and how 
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Fig. 4: The knowledge network at a cluster level 



they split into sub-clusters, how new venues are connected to the existing venues 
and clusters, and how the strength of the connections between clusters increases 
to form the "shape" of the computer science knowledge. 

To answer these questions, we visualize the knowledge network at different 
time points, from 1990 to 2005, with 5-year intervals using the same technique as 
presented in Section 3 and Section 4. To compute the similarity between venues 
at a certain time, we consider only the papers published from this time point 
backwards. The scales for cosine similarities (the thickness of connections) and 
venue size (node size) have been kept constant to enable easy inspection of the 
changes. Note that the assignment of venues to sub-disciplines by Libra is not 
perfect, so there are some misclassifications. To interpret the visualization, we have 
to base on both the clusters of venues grouped by the visualization algorithm and 
the sub-discipline labels, where in each cluster, if a sub-discipline has a dominant 
number of venues then this cluster represents that sub-discipline. 

The visualizations of the knowledge network in 1990, 1995, 2000 and 2005 are 
given in Fig. [S] |6l [7] and [S] A close inspection of these figures and Fig. [2] reveals 
many changes. In 1990, the knowledge network is not clearly clustered. Although 
we can identify the groups of venues in some sub-disciplines such as database, ar- 
tificial intelligent, algorithms and theory, software engineering and programming 
languages, privacy and security, the venues in these domains are distributed in sev- 
eral groups and the connection in these groups is very sparse (low density) . Some 




sub-disciplines even are separated into disconnected components (e.g. a group of 
software engineering venues at the bottom left corner). In 1995 there are still 
some disconnected groups, but venues start to come closer to form the core of 
sub-disciplines. We can also observe the early connections between fields. At the 
center, there is a large body of algorithms and theory which has many connections 
to other large clusters such as software and programming languages, database, 
artificial intelligence, distributed and parallel computing. Computer graphics (on 
the right hand) starts a cluster and has connections to human-computer interac- 
tion. Some other sub-disciplines emerge, such as machine learning and data mining 
emerge from artificial intelligence, networking separates from operating system and 




Fig. 6: The knowledge network in 1995 



distributed and parallel computing. Well established venues (shown in Fig.[5|) con- 
tinue to play the central role in the domains, such as VLDB, SIGMOD and TKDE 
in database, TSE and ICSE in software and programming languages, SIGCOMM 
and INFORCOM in networking, AAAI, AI and IJCAI in artificial intelligence. 

We observe these trends also in 2000 and 2005 where sub-disciplines become 
more organized and mature. The connections between sub-disciplines are also be- 
come clearer, reflecting the interdisciplinary nature of computer science research. 
Sub-disciplines are also starting to separate, as we see in Fig. [7| and Fig. [8] where 
artificial intelligence is divided into several clusters, or bioinformatics emerge from 
database research. However, merging seems to dominate the development trend. 
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Fig. 7: The knowledge network in 2000 



where disconnected components of the network join to the giant component. For 
example, in 1990 there are several disconnected components of software engineer- 
ing, privacy and security as well as other domains. In 1995, a big disconnected 
component of software engineering joins the giant component. The component of 
privacy and security stays disconnected, but becomes bigger. Then in 2000, that 
component finally joins the giant component. It is interesting to connect these ob- 
servations to what actually happened at that time. For example, in 1996, HTML 
2.0 specification was maintained as a standard and in 1997, it became an interna- 
tional standard (RFC 2070). That is the reason for the blow of the Internet with 
many commercial software vendors and platforms, especially Internet Explorer 




Fig. 8: The knowledge network in 2005 



developed in the Windows 95 system. Before that time, privacy and security was 
quite an isolated research domain in computer science. However, with the increas- 
ing use of the Internet where people can exchange information quickly and freely, 
security becomes one of the main concerns and attracts a lot of attention from 
both industry and research community. That could be the reason why in 1995, 
security and privacy research stays as a disconnected component, but in 2000 it 
connects to the giant component and is one of the large clusters. 

The visualizations in Fig.[5l|6l[71[8]reveal a lot of information, more than we can 
describe here. Thus, we highlight the main changes and trends. Over time, main 
topics in computer science, including algorithms and theory, artificial intelligent. 



database, networking, and software engineering, develop consistently. Domains 
become more and more interdisciplinary where they connect more or less to other 
domains or sub-domains. Fields are starting to split into sub-fields, though merging 
dominates the development trend. New fields or sub-fields continuously emerge 
from the existing sub-disciplines. With the growth of the Web, data mining and 
information retrieval, emerging from database and artificial intelligence, as well 
as privacy and security are becoming more and more exciting fields with a lot of 
conferences and journals. 



8 Venue Ranking 

There are different metrics to evaluate the performance and prestige of individuals 
and journals such as citations count, H-index [17j, impact factoiQ by the Institute 
for Scientific Information (ISI), now part of Thomson Reuters. However, these 
metrics are controversial |401I22| and it turns out that using one single metric 
cannot judiciously evaluate the impact of a journal or a scientist. A metric should 
be used in the combination with other metrics to fully and reliably justify the 
performance of scientists, publications and venues. 

We employ two social network measures, betweenness centrality and PageRank, 
for venue ranking. These measures do not intend to be a replacement, but a com- 
plement to the existing metrics. Given the assignment of venues to domains and 
the citation network F' , we calculated node's betweenness centrality and PageR- 
ank to determine interdisciplinary and high prestige venues, respectively. Using 
PageRank has one advantage over the impact factor: PageRank does highly rank 
venues which are cited by other highly ranked venues, so new venues have a higher 
impact when they are cited by well-known venues. 

Betweenness centrality [36] of a venue Vi is defined as the number of shortest 
paths in the network that pass through Vi and it is computed as follows: 



where Pi^j is the number of (weighted) shortest paths between venues Vi and Vj , 
Pij{k) is the number of that shortest paths which go through venue Vk- Highly 
value of betweenness centrality indicates a venue as a "gateway" which connects 
a large number of venues and venue clusters. Venues with high betweenness cen- 
trality values often are interdisciplinary. Table [2] gives the list of top 30 centrality 
venues. They are indeed highly interdisciplinary venues. The first position is CORR 
(Computing Research Repository) with the betweenness 0.185. DBLP classifies it 
as a journal, but in fact CORR is a repository to which researchers could submit 
technical reports. CORR covers almost every topic of computer science. Papers 
published in CORR are not peer reviewed, only the relatedness to the topic area 
is checked. That is the reason for the appearance of CORR as a large venue in 
the visualization (Fig. [5]) and as a top interdisciplinary venue. Among others, AI, 
machine learning, databases and the world wide web contribute ten venues to this 
list. That confirms their interdisciplinary nature refiected in Fig. [H 

* |http : //thomsonreuters ■ com/products_services/sclence/academlc/lmpact_f actor/"] 




(2) 



Table 2: Top betweenness centrality venues 



Rank Name 


Type 


Libra classification 


1 


CORK 


J 


Un- categorized 


2 


TCS 


J 


Algorithms and Theory 


3 


INFOCOM 


C 


Networks&Communications 


4 


AI 


J 


Artificial Intelligence 


5 


CSUR 


J 


Un- categorized 


6 


TC 


J 


Un- categorized 


7 


TSE 


J 


Software Engineering 


8 


JACM 


J 


Un- categorized 


9 


CACM 


J 


Un- categorized 


10 


CHI 


C 


Human-Computer Interaction 


11 


ML 


J 


Machine Learning 


12 


IJCAI 


c 


Artificial Intelligence 


13 


TOPLAS 


J 


Software Engineering 


14 


AAAI 


c 


Artificial Intelligence 


15 


PAMI 


J 


Un- categorized 


16 


ICRA 


c 


Artificial Intelligence 


17 


SIAMCOMP 


J 


Un- categorized 


18 


TPDS 


J 


Distributed&Parallel Computing 


19 


ICDE 


c 


Databases 


20 


WWW 


c 


World Wide Web 


21 


TKDE 


J 


Databases 


22 


CVPR 


c 


Computer Vision 


23 


ENTCS 


J 


Algorithms and Theory 


24 


VLDB 


c 


Databases 


25 


IPPS 


c 


Scientific Computing 


26 


ALGORITHMICA J 


Algorithms and Theory 


27 


ICDCS 


c 


Networks&Communications 


28 


CAV 


c 


Software Engineering 


29 


SIGGRAPH 


c 


Graphics 


30 


CN 


J 


Networks&Communications 



The PageRank score of a venue is computed according to the PageRank algorithm 0. 
The algorithm iteratively calculates the PageRank score of a venue based on the 
score of its predecessors in the network as in the following equation. 

pm = {i-d) + dY^g^^ (3) 

where P{Vi) is the PageRank score of venue Vi, Vj is the predecessor of Vi and 
0{Vj) is out-degree of Vj. Parameter d is the dumping factor which usually is set 
to 0.85 in literature. We note that the dumping factor d models the random Web 
surfer. Web surfing behavior is different to citing behavior, so the value of d maybe 
different in our case. We use here the same value of d and keep this note in mind. 

The list of 30 highest PageRank venues is given in Table [3] where column Type 
denotes type of venue (J for journal and C for conference/ workshop). PageRank 
favors venues that are well-connected to other well-connected venues. Surprisingly, 
CORR is in sixteenth position though it mostly consists of technical reports. The 
list in Table [3] contains not only journals, but also the leading conferences in the 
fields. From the list, one can see the well-known venues such as Communication of 
the ACM (CACM), Journal of the ACM (JACM), Journal of Artificial Intelhgence 
(AI), SIAM Journal on Computing (SIAMCOMP) and ACM Transaction on Com- 



puter Systems (TCS) as well as conferences in different fields such as SIGGRAPH, 
AAAI, SOSP, SIGCOMM, POPL, VLDB, NIPS etc. 



Table 3: Top PageRank venues 



Rank Name 


Type 


Libra classification 


1 


CACM 


J 


Un-categorized 


2 


JACM 


J 


Un-categorized 


3 


AI 


J 


Artificial Intelligence 


4 


SIAMCOMP 


J 


Un-categorized 


5 


TCS 


J 


Algorithms and Theory 


6 


SIGGRAPH 


C 


Graphics 


7 


TSE 


J 


Software Engineering 


8 


JCSS 


J 


Un-categorized 


9 


AAAI 


c 


Artificial Intelligence 


10 


SOSP 


c 


Operating Systems 


11 


SIGCOMM 


c 


Nctworks&Comiimnications 


12 


PAMI 


J 


Machine Learning 


13 


INFOCOM 


c 


Networks&Communications 


14 


IJCAI 


c 


Artificial Intelligence 


15 


POPL 


c 


Software Engineering 


16 


CORK 


J 


Un-categorized 


17 


lANDC 


J 


Un-categorized 


18 


TOGS 


J 


Un-categorized 


19 


ISGA 


c 


Hardware and Architecture 


20 


TC 


J 


Un-categorized 


21 


STOC 


c 


U n- categorized 


22 


VLDB 


c 


Databases 


23 


ML 


J 


Machine Learning 


24 


PLDI 


c 


Software Engineering 


25 


TOPLAS 


J 


Software Engineering 


26 


TON 


J 


Un-categorized 


27 


SODA 


c 


Algorithms and Theory 


28 


NIPS 


c 


Machine Learning 


29 


COMPUTER 


J 


Un-categorized 


30 


TIT 


J 


Algorithms and Theory 



9 Understanding the Collaboration and Citation Behavior 

9.1 Venues Subgraphs 

To understand the collaboration and citation behavior of the communities of 
venues, we study the properties of the co-authorship and citation subgraphs of 
venues. We take all the papers published in a venue and extract its co-authorship 
network. The resulting network consits only the collaborations of the authors in the 
venue. Note that two authors might collaborate with each other in other venues, 
but might not collaborate in the venue under consideration. However, since we 
investigate the collaborations of authors working on the topics of the venue and 
how the venue maintains and cultivates these collaborations, it is not necessary 
to consider the collaborations of these authors in other venues. To create citation 
subgraphs, we take all the publications cited by papers published in a given venue, 



project them on the underlying citation graph and extract the subgraph of cita- 
tions among these publications. Formally, we define the co-authorship subgraph 
Ga = {A, E) of a venue is a graph where A is the set of authors who published 
some papers in this venue and there is a connection e £ E between author ai and 
Oj £ A if they wrote a paper published in this venue together. Similarly, we define 
the citation subgraph of a venue Gc = {P, C) , where P is the set of publications 
cited by papers published in this venue and C is the set of citations among these 
publications. 

Given the co-authorship and citation subgraphs of venues, we then elaborate 
a set of network metrics that characterize and describe their structure. To give 
an idea about what type of networks we are trying to classify, let us take a look 
at the example given in Fig.O Fig. 9a (type 1) shows a network that is sparsely 
connected. The density of this network is rather low. In Fig. 9b (type 2), nodes are 
clustered in small disconnected components. Fig. 9c (type 3) describes a network 
where several small disconnected components come together to form a large con- 
nected component. In Fig. 9d (type 4), there exists a dense, large component and 
several small components connected to it. Intuitively, for citation subgraphs, net- 
work type 2 demonstrates the venues where the citations are placed in un-related 
sub-disciplines, meaning that it is unfocused. In network type 3, different clusters 
of papers that correspond to different sub-disciplines are cited but the connections 
between these sub-disciplines are also identified. Network type 4 illustrates a fo- 
cused and interdisciplinary venue where the cited papers are clustered in a big 
largest component that can be considered as the main theme of the venue, and 
this component is connected to many other smaller components corresponding to 
the related sub-disciplines. 

We employ four network metrics [36] in order to distinguish the four types of 
network. For every venue, we use these four metrics to characterize the features of 
its citation and co-authorship subgraphs. The four metrics are defined as follows: 

— Density (Ml): Density of a graph G = (V^ E) where V is the set of vertices, 
E is the set of edges, is defined as: 

°"^'° |V|(l'l^l'-l) 

— Clustering coefficient (M2): Local clustering coefficient of a node Vi is de- 
fined as follows: 

^ number of closed triads connected to Vi 
number of triples of vertices centered on Vi 

The average local clustering coefficient is defined as 

— Maximum betweenness (M3): is the highest betweenness of the nodes in 
G. The betweenness of a node Vi is defined as 



^ a{vj,Vk) 



^ 

V ° ^ 
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(a) Network Type 1 




(b) Network Type 2 
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Fig. 9: Network types 



where cr'"* (wj , u^) is the number of shortest paths from node Vj to node Vk that 
pass through Vi and cr{vj, Vk) is the total number of shortest paths from Vj to 
Vk- The betweenness may be normahzed by dividing through the number of 
pairs of vertices not including Vi, which is (n — l){n — 2) for directed graphs 
and (n — l)(n — 2)/2 for undirected graphs, where n is the number of vertices 
in the network. 

— Largest connected component (M4) : the fraction of nodes in the largest 
connected component. 

To summarize, the four metrics allow us to differentiate the four types of network 
based on the scheme in table [D 



Table 4: Network types and properties 



Type 


Ml 


M2 


M3 


M4 


Type 1 


Very low 


Very low 


Very low 


Very low 


Type 2 


Low /Medium 


High 


Low 


Medium 


Type 3 


Low /Medium 


Medium 


Low/Medium 


High 


Type 4 


Medium/High 


Medium 


Very high 


Very high 
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Fig. 10: Properties of collaboration and citation graphs of venues 



9.2 Characteristics of Computer Science Venues 



The first question we address is that to what extend the venues in computer science 
are focused and how authors collaborate on that basic. In particular, we compare 
the properties of citation and co-authorship subgraph of journals and conferences 
to identify the differences between two types of publishing. To gain insights into the 
above questions, we process as follows: for all venues, we create their collaboration 
and citation subgraphs Ga and Gc- Then we compute the four metrics defined in 
Section 9.1. For each metric we create a normalized histogram and by observing 
these histograms we are able to examine the characteristics of venues. 

The normalized histograms of the four metrics are given in Fig. 1101 Firstly, most 
of the venues are not narrow, but they are indeed interdisciplinary (shown by low 
density and medium clustering coefficient of citation subgraphs in Fig. 10a and Fig. 
10b). However, venues also tend to develop a main theme which is the main focused 
and closely related topics as the core topics. That is shown by big largest connected 
component in the citation subgraphs (Fig. lOd). According to the scheme in Table 
m most of the venues fall into the network type 3, characterized by low/medium 
density, medium clustering coefficient, low/medium maximum betweenness and 
big largest connected component. 



We now consider the collaborative behavior of researchers in the venues. In 
Fig. IIOI we can see that most of the co-authorship subgraphs are of network type 
2 (low/medium density, high clustering coefficient, low maximum betweenness and 
medium largest connected component). That means researchers in the venues are 
clustered in disconnected working groups. The relative small number of venues that 
have big largest connected component (Fig. lOd) implies that though venues tend 
to develop the main theme, not so many of them successfully stimulates authors 
to collaborate on that theme. Low maximum betweenness (Fig. 10c) suggests that 
the gateways who connect several working groups rarely exist in the venues. We 
will investigate the relation between the existence of the gateways and the impact 
of venues in the next section. 

Now we compare the properties of citation and co-authorship subgraphs of 
conference and journal. The question we try to address is that whether confer- 
ences expose the same pattern in citation and collaborative behavior as journals. 
Fig. [11] and Fig. [12] show the comparison of network properties of citation and 
co-authorship subgraphs of journal and conference. In general, most of journal 
and conference citation subgraphs are of network type 3 (low/medium density, 
medium clustering coefficient, medium maximum betweenness and big largest con- 
nected component). However, clustering coefficient of conferences' citation graph 
is higher than that of journals and maximum betweenness is lower. That means 
citations of conferences are placed in more disconnected clusters, which suggests 
that conferences are less focused than journals. A close look at the Fig. [12] reveals 
some differences in collaborative behavior. Clustering coefficient and maximum be- 
tweenness of conferences' co-authorship subgraph are higher than journals', mean- 
ing that there exists more gateways in conferences than in journals and researchers 
in conferences tend to collaborate with peers in other working groups. 

To summarize, venues in computer science are indeed interdisciplinary. Most of 
them established a core area while still connecting to other related areas. Journals 
are more focused than conferences, but conferences facilitate the communication 
between participants whose collaborations tend to cross different communities. 

9.3 Venues Subgraph and the PageRank 

Now we investigate the relation between the ranking of venues and the properties 
of collaboration and citation subgraphs. Our interest is whether the properties of 
collaboration and citation subgraphs reflect the impact of venues. In Fig. 1131 we 
plot the median of the network properties for each PageRank value in order to 
analyze this relation. 

Several observations can be made here. On the one hand, citation network of 
highly-ranked venues are of network type 3 (low/medium density, medium cluster- 
ing coefficient, low/medium betweenness and big largest connected component), 
meaning that highly-ranked venues are focused. The co-authorship network of 
highly-ranked venues fall into network type 4, characterized by medium cluster- 
ing coefficient, very high maximum betweenness and very big largest connected 
component. The vast majority of authors in those venues are connected in a large 
component and that component is connected to many other small groups via gate- 
keepers. On the other hand, it is not easy to identify the type of the citation sub- 
graph of low-ranked venues. They might lay between network type 2 and type 3, 



(a) Density 



(b) Clustering Coefficient 




■ Journai 

■ Conference 



0.01 0.02 0.03 0.04 0.05 
Density 

(c) Maximum Betweenness 




- Journai 

- Conference 




0.2 0.3 0.4 0.5 O.e 
Clustering coefficient 

(d) Largest Connected Component 



0.4 0.6 0.8 

Maximum betweenness 




0.2 0.4 0.6 O.E 

Largest connected component 



Fig. 11: A comparison of network properties of citation subgraph of journals and 
conferences 



with high/medium clustering coefhcient, low/medium maximum betwenness and 
low/medium largest connected component. However, co-authorship subgraph of 
low-ranked venues are clearly of type 2, where authors are clustered in discon- 
nected groups. 



To summarize, highly-ranked venues are focused as they develop the main 
topics as the core and successfully motivate authors to collaborate on these topics. 
In these venues, there exist key members who connect different subgroups to the 
core. They serve as a gate to join the new ideas to the main theme of the venue. 
This is very important for every community of practice since one of the key success 
factors is not only to retain the well-developed ideas but also attract people to 
bring new ideas to the community |19lll8l[20] . Although low ranked venues might 
also develop the main theme, but they mostly do not successfully build up a large 
community to work on that or they are still in the early phase of developing their 
community. 
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Fig. 12: A comparison of network properties of co-authorship subgraph of journals 
and conferences 



10 Conclusions 

In this paper, we presented our study on knowledge network of computer science. 
Based on the combined DBLP and CiteSeerX databases, the knowledge network 
is generated using both journal and conference publications. The visualizations 
show the cluster structure of computer science knowledge network, which is not 
possible by the analysis of journals-only. Venues of the same fields or related fields 
are grouped into clusters which can be defined as disciplines or sub-disciplines. We 
analyze the development of computer science disciplines by visualizing the knowl- 
edge network at different time points. One important conclusion is that conferences 
constitute the social structures that shape the computer science knowledge and 
the field is becoming more interdisciplinary as sub-disciplines are connected to 
many others. 

We analyze the citation and collaboration subgraphs of venues by different 
SNA metrics. We find that venues are interdisciplinary and they develop their 
core topics as the main focus. By comparing the citation, collaboration subgraphs 
of journal and conference, our study shows that though journals are more focused 
than conferences, the latter facilitate the communication between researchers. We 
further analyze the relation between the impact and the properties of citation and 
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Fig. 13: Properties of collaboration and citation graphs of venues as a function of 
PageRank 



collaboration subgraph of venues. One important conclusion is that highly ranked 
venues successfully develop their theme as well as their community and experts 
are the key success factor for the development of a venue. That confirms one of 
the principle for cultivating scientific community of practice studied by several 
researchers. 

In the future, more digital libraries need to be integrated to obtain complete 
citation information. Given the objective of this paper is to study the macro struc- 
ture of computer science, DBLP and CiteSeerX are quite sufficient. However, to 
study the structure of knowledge network at more detail and local level (i.e at the 
sub-discipline level), more citation data and venue proceedings are needed. Sev- 
eral datasets are possible, e.g ACM, IEEE Xplore, Microsoft Academic Research, 
CEUR-WS.or^. Citation information could also be gathered from search engines 
like Google Scholar. Furthermore, the ranking studied in this paper is global rank- 
ing. It probably does not reflect the complete importance of a venue in a particular 
field, especially in some more marginal disciplines such as computer graphics or 
bioinformatics. Therefore a deep analysis and ranking at the sub-discipline level is 



^ |http : //sunslte . lnformatlk.rwth-aacheii.de/Publicatlons/CEUR-WS71 



necessary to gain an insight into a particular domain and to have a full evaluation 
of the impact of venues. 
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